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Quantifying diversity is of central importance for the study of struc- 
ture, function and evolution of microbial communities. The estima- 
tion of microbial diversity has received renewed attention with the 
advent of large-scale metagenomic studies. Here, we consider what 
the diversity observed in a sample tells us about the diversity of the 
community being sampled. First, we argue that one cannot reliably es- 
timate the absolute and relative number of microbial species present in 
a community without making unsupported assumptions about species 
abundance distributions. The reason for this is that sample data do 
not contain information about the number of rare species in the tail of 
species abundance distributions. We illustrate the difficulty in compar- 
ing species richness estimates by applying Chao's estimator of species 
richness to a set of in silico communities: they are ranked incorrectly 
in the presence of large numbers of rare species. Next, we extend our 
analysis to a general family of diversity metrics ("Hill diversities"), 
and construct lower and upper estimates of diversity values consistent 
with the sample data. The theory generalizes Chao's estimator, which 
we retrieve as the lower estimate of species richness. We show that 
Shannon and Simpson diversity can be robustly estimated for the in 
silico communities. We analyze nine metagenomic data sets from a 
wide range of environments, and show that our findings are relevant 
for empirically-sampled communities. Hence, we recommend the use of 
Shannon and Simpson diversity rather than species richness in efforts 
to quantify and compare microbial diversity. 

Accepted paper in press at The ISME Journal; doi:10.1038/ismej.2013.10 

Subject category: Microbial population and community ecology 

Keywords: Chao estimator; Hill diversities; metagenomics; Shannon diversity; 
Simpson diversity; species abundance distribution 



1 



Introduction 



Species diversity is a crucial property of ecological communities: it is the primary descriptor of 
community structure, and it is generally believed to be a major determinant of the functioning 



and the dynamics of ecological communities (Wilson, 1999 Loreau et al.. 2001 Ives and 



Carpenter 2007 Loreau 2010). Therefore, diversity measurement is often a first step in 



characterizing an ecological community (Brose et al. 2003 Magurran 2004 Gotelli and 



Colwell 2011). Because an exhaustive census of the community is usually not feasible. 



community diversity must be inferred from the diversity observed in a sample taken from 
the community. The inference problem can be difficult, especially when community diversity 



is believed to be very large ( (Engen[ |1978[ |Bunge and Fitzpatrick[ |1993[ |Mao and Colwell 
20051). 
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Diversity measurement is particularly challenging for microbial communities (Hughes et al. 
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there is no unambiguous way to define microbial "species" ( Stackebrandt et al. 



it should be recalled that 
20021). Here 



we use the term species pragmatically to mean an operationally determined taxonomic unit 
(e.g., 97% identity of 168 rRNA (Schloss and Handelsman 2005)). However measured, the 
species diversity of microbial communities is usually much larger than that of communities of 
larger organisms. Moreover, the number of organisms in microbial communities is typically 
many orders of magnitude larger than the number of organisms in plant or animal communi- 
ties (Whitman et al. 1998). This leads to severe sampling problems. Although metagenomic 



approaches allow for impressively large sample size ( Huber et al. 2007 Roesch et al. 2007 



Rusch et al. 2007 ) , even these huge samples correspond to a tiny fraction of the community 
being sampled. Hence, for microbial community samples, community diversity is generally 
much larger than sample diversity. This disparity between community and sample leads to 
a challenge that we address here: how can microbial diversity be estimated robustly? 

One popular approach to circumvent the sampling problem is to assume that the species 
abundance distribution of the community belongs to a specific family (for example, the fam- 



ily of lognormal distributions) ( Curtis et al. 2002 Hong et al. 2006 Schloss and Handelsman 



2006 Quince et al. 2008 ) . Such an assumption fills in the information about the community 



missing in the data and leads to precise diversity estimates. But the validity of the estimates 
depends crucially on the choice of the species abundance distribution family. This choice 
cannot be verified empirically because the sample data do not contain sufficient informa- 
tion about the community structure. In fact, many distribution families yield extrapolated 
community structures that are consistent with the sample data. Here we show that the 
extrapolation approach has intrinsic limitations. 

Other methods for diversity estimation have been proposed. For example, proposals have 



been made to extrapolate the rarefaction curve beyond the actual sample size (Gotelli and 



Colwell 2001 Colwell e< a/. 2004), or to assume a particular distribution for the community 



diversity over taxonomic levels fMay', '1988', 'Mora et al. 2011). Eventually, also these methods 



are limited by the lack of information about the community structure in the sample data. 
Rather than filling this gap by unverifiable assumptions, here we ask what can (and cannot) 
be inferred from the sample data alone. An interesting step in this direction is given by the 
popular Chao estimator (Chao^ ,1984; ^Shen et al. 2003 Chao et al. 2009) . Chao's estimate 
can be interpreted as a lower estimate of the species richness consistent with the data. 
We take the estimation strategy underlying Chao's estimator a step further, and construct 
lower and upper estimates for a general family of community diversities, including species 



richness. Shannon diversity and Simpson diversity (Hill 1973). The unification we propose 



here represents a robust approach to estimating microbial diversity in theory and in practice. 
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Materials and methods 



Data sets 



The data sets used in this paper were downloaded from the supplementary material of |Quince| 
(2008). The abundance data used in Figure 1 correspond to 16S rDNA sequences 

The 



et al. 



Roesch et al. (2007)). 



obtained from a bacterial soil community (sample "Brazil" m 
abundance data used in Figure 5 correspond to 16S rDNA sequences obtained from a bacterial 
seawater community from the upper ocean (Rusch et al. 2007), from four bacterial soil 



communities (Roesch et al. 2007), and from bacterial and archaeal seawater communities 



from two hydrothermal vents ( Huber et al. 2007 ) 



Rank-abundance curves 



We represent the species abundance distribution of a community as a rank-abundance curve, 
that is, we arrange the species in decreasing order of community abundance, and plot species 
abundance as a function of species rank. We use logarithmic scales for both axes of the 
rank-abundance curves, so that a community with power-law abundance distribution is rep- 
resented as a straight line (the slope is equal to the power-law exponent), see Figure 2 A. 
We constructed the communities of Figure 1 by using a piecewise linear parametrization of 
the rank-abundance curve. Hence, the species abundance distributions consist of power-law 
segments with different exponents. 



Rarefaction curves 



We define 5*^ as the expected number of species in a sample of m individuals taken from the 
community (sampling with replacement) . The rarefaction curve of the community is the plot 
of the number of species 5™ as a function of the sample size m. It is important to distinguish 
the community rarefaction curve from the rarefaction curve estimated from sample data. For 
a sample of size M taken from the community, the part of the rarefaction curve corresponding 
to Sm with m < M can be estimated by subsampling the sample data. The same approach 
fails for the part of the rarefaction curve corresponding to Sm with m > M . In that case 
the rarefaction curve has to be extrapolated, introducing large estimation uncertainty. We 
studied two extreme extrapolation scenarios: one for the slowest (i.e., smallest slope) and 
one for the fastest (i.e., largest slope) increase of the rarefaction curve compatible with the 
sample data, see Figure 3. 



Hill diversities 



The Hill diversities, defined in Equation ([3|, can be computed if the community abundances 
are known. If only sample data are available. Hill diversities have to be estimated. We 
consider sampling with replacement, and denote by M the sample size and by the number 
of species sampled k times. We developed an estimation procedure that exploits the link 
between Hill diversities Da and the rarefaction curve Sm- The lower estimate of the 
rarefaction curve. 




if m < 
if m > M, 
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yields the lower estimate of the Hill diversity, 



oo 
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where P denotes the gamma function. Similarly, the upper estimate of the rarefaction curve, 

'Efe>iJ^fc(l- if m<M 
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with N the (estimated) community size, yields the upper estimate of the Hill diversity, 

The estimators ([l]) and ([2]) can be computed with the Matlab code in the Supplementary 
Information, and were used to generate Figures 4 and 5. 



Results 



Species richness cannot be estimated from sample data alone 

We are interested in estimating the diversity of a community based on the composition of 
a sample taken from the community. Our approach is to reconstruct community structures, 
i.e., species abundance distributions, from the sample data. For the example data set of 
Figure 1, we find that a wide range of communities are consistent with the sample data. The 
reconstructed communities have vastly different numbers of species, differing by two orders 
of magnitude, implying that estimating species richness is subject to large biases. 

We claim that sample data is always consistent with very different community structures. 
To establish this claim we study the link between the rare species tail of the community and 
the sample data, summarized by the rarefaction curve. A computation in Supplementary 
Text SI shows that the rarefaction curve up to sample size M is insensitive to the abun- 
dance distribution of species with relative abundance well below For concreteness we 
set a relative abundance threshold at g^fj^j, and we call the species with larger and smaller 
relative abundance than this threshold the "non-rare" and "rare" species, respectively. The 
computation shows that the rarefaction curves does not depend on the abundance distribu- 
tion of the rare species. Changes in the rare species tail, such as increasing the number of 
rare species by several orders of magnitude (but keeping the total abundance of rare species 
constant), does not affect the sample data. As a consequence, estimating species richness is 
intrinsically problematic. 

Note that we use a statistical definition of rarity which depends on the sampling effort M\ the 
set of rare species gets smaller when sampling gets deeper. This contrasts with the ecological 



concept of rarity, a community property independent of sample size ( Pedros-Alio , 2006 Sogin 



et al.. 2006), see the Discussion section. 



To further illustrate the theoretical result we reconsider the reconstructed communities of 
Figure 1. The communities have the same abundance distribution of the non-rare species. 
In each community the set of rare species occupies 0.5% of the total community abundance, 
explaining why the corresponding rarefaction curves coincide, see Figure ID. Nevertheless, 
the number of rare species differs by two orders of magnitude. Another example of in silico 
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Figure 1: Empirical sample data are consistent with very different communities. We consider 
the abundance data of a sample taken from a bacterial soil community (sample "Brazil" in 



Roesch et al. (2007)). The sample consists of 26079 individuals belonging to 2880 species. 



We tried to reconstruct the community from which the sample was taken. Panels a-c show 
the rank-abundance curve of three such reconstructed communities. The first community 
(panel a, in red) has 10^ species; the second community (panel b, in blue) has 10^ species; 
the third community (panel c, in green) has 10^ species. For each of the three reconstructions 
the community rank-abundance curve is an extension of the sample rank-abundance curve 
(in black). We claim that each of the three reconstructed communities is compatible with the 
sample data. This can be seen from the rarefaction curves in panel d: the rarefaction curve 
for the sample data (black line) coincides with the rarefaction curves for the reconstructed 
communities (red line with squares for community in panel a, blue line with x -marks for 
community in panel b, and green line with diamonds for community in panel c). Because 
the sample data are consistent with very different values of the community richness, the 
community richness cannot be estimated from the sample data. 
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Figure 2: Estimated species richness does not rank correctly communities. We generated 
three community abundance distributions, the rank- abundance curves of which are shown in 
panel a. Community CI (red) has the smallest number of species; community C3 (green) 
has the largest number of species. The rarefaction curves of the three communities up to 
sample size 2 10^ are shown in panel b. Based on the rarefaction data, one would conclude 
that community CI is the most diverse and community C3 the least diverse. Hence, the 
ranking of the communities according to their observed diversity is inverted compared to the 
ranking according to their true diversity. This observation is confirmed when applying Chao's 
estimator to sample data. Community CI is estimated to have 10 times more species than 
community C3, whereas in reality community CI has 20 times less species than community 
C3. See Supplementary Table SI for the numerical data of the communities. 

communities with very different rare species tails but with the same rarefaction curve is 
shown in Supplementary Figure SI. 

We conclude that sample data do not allow us to distinguish communities with very different 
rare species tails. The insensitivity of the rarefaction curve to rare species implies that it is 
difficult or impossible to reliably estimate the community species richness from sample data 
alone. 

Relative species richness cannot be estimated from sample data alone 

We have shown that the number of species in a community cannot be reliably estimated 
from sample data. A related question is whether sample data can be used to rank different 
communities according to their number of species. In this section we show that this cannot 
be done without additional assumptions. 

We present an explicit example to illustrate the use of sample data to rank communities, see 
Figure 2. We consider three communities which differ widely in species richness: community 
CI has 20 times fewer species than community C3. We construct the initial arcs of these 
rarefaction curves, see Figure 2B. Surprisingly, the rarefaction curves suggest that community 
CI is the most diverse, and community C3 the least diverse. We therefore expect that any 
estimator of species richness ranks the communities in the inverse order of their true species 
richness. Indeed, Chao's estimator predicts that community CI has almost 10 times as many 
species as community C3 (see Supplementary Table SI; values are averaged over sample 
randomness). 

To understand the incorrect ranking we take a closer look at the communities in Figure 2A. 
We explained, in the previous section, that sample data are insensitive to rare species. When 
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we compare the number of non-rare species in the communities (species with relative abun- 
dance above 10^^), we find that community CI has 15 times more non-rare species than 
community C3. This explains why the sample data suggest that community CI is the most 
diverse. Community CI has a large number of non-rare species combined with a relatively 
small number of rare species. In contrast, community C3 has a relatively small number 
of non-rare species combined with a very large number of rare species. This explains the 
discrepancy between true number of species, mainly determined by the rare species, and 
estimated number of species, determined by the non-rare species. 

The example of Figure 2 indicates a general problem: relative species richness cannot be 
reliably estimated. The problem is due to the same mechanism as the one identified in the 
previous section. Sample data cannot be used to rank communities according to their number 
of species because sample data do not contain information about the number of rare species. 



Some generalized diversities can be estimated from sample data alone 



Altough insensitive to rare species, sample data do contain information about the community 
structure. In this section we demonstrate that diversity indices that are weakly dependent 
on rare species can be estimated from sample data. 

Diversity is a broader notion than species richness. Alternative definitions of diversity have 
been proposed in which rare species contribute less than common species. These alternative 
diversities account not only for species richness but also for the evenness of the community 
structure. Examples are the Shannon diversity index ( Shannon] 1948) and the Simpson 
diversity index (Simpson' '1949'). Here we study a family of generalized diversities, the Hill 
diversities Da (Hill 1973) that includes these two examples as well as species richness as 
special cases. For a community consisting of S species with relative abundances pi,p2, • ■ • ,PSj 
the Hill diversities are defined by 



(3) 



We obtain a Hill diversity for each value of the parameter a. For a ~ the species are 
weighted equally in the sum of Equation ^ (each term is equal to one), and Dq — S, i.e., Dq 
is equal to species richness. For a > the species are not weighted equally. Instead, a rare 
species contributes less than a common species. For larger values of a the weighting is more 
unequal, see Supplementary Text S2. As an extreme case, only the most abundant species 
contributes in the limit a — ^ oo. The Hill diversity of order 1 is related to the Shannon 
diversity index (note that Definition ([3| should be understood as Di — limQ_j.i _D„) and the 
Hill diversity of order 2 is related to the Simpson concentration index. The Hill diversity for a 
community in which all S species have the same relative abundance pi = is equal to Da = S 
for any value of the parameter a. This indicates that any Hill diversity Da can be considered 
as an effective number of species (Hill 1973 Jost 2006), which facilitates the interpretation 



of estimated diversity values and allows us to compare the estimation properties of different 
Hill diversities. 

As a increases the Hill diversities are increasingly insensitive to the tail of rare species and 
are more strongly determined by the non-rare species, see Supplementary Figure S2. Hence, 
we expect that they are more accurately estimated from sample data. A mathematical link 
between the Hill diversities and the rarefaction curve further indicates which Hill diversities 
can be estimated from sample data. In Supplementary Text S3 we show that any Hill diversity 
Da can be expressed in terms of the rarefaction curve. The Hill diversity D2 is related to 
the initial slope of the rarefaction curve (Lande et al. 2000). Thus, for a close to 2, the Hill 
diversity Da depends on the part of the rarefaction curve for small sample size. For smaller 
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Figure 3: Extrapolating the rarefaction curve. The Hill diversity estimators and 
are based on reconstructions of the rarefaction curve Sm from sample data. For a sample 
of size M , the rarefaction curve Sm for m < M can be estimated by subsampling (red full 
line). If the sample size M is large, the estimator has small uncertainty. The rarefaction 
curve Sm for m > M can be estimated by extrapolating the sample data beyond the sample 
size M. Different extrapolation scenarios are compatible with the sample data. We consider 
two extreme scenarios (red dashed lines). A lower estimate is obtained by assuming that 
unobserved species are approximately as rare as the rarest observed species. An upper 
estimate is obtained by assuming that unobserved species are represented in the community 
by one individual. The difference between the two extremes quantifies the uncertainty of 
the extrapolation, shown as the red shaded region. The uncertainty increases rapidly for 
m > M. 



a, the Hill diversity Da depends on the rarefaction curve for increasingly large sample size. 
The Hill diversity Dq is equal to species richness, which can be obtained as the limit of the 
rarefaction curve for infinite sample size. 

These observations have important implications for the diversity estimation problem. We 
suppose that sample data of size M are given, and we try to estimate the rarefaction curve 
at sample size m. The community rarefaction curve for sample sizes m < M can be estimated 
in an unbiased manner by subsampling the sample data, but for m > M the rarefaction curve 
can only be estimated based on extrapolation. This leads to increasingly biased estimates 
as m increases. Hence, we reach the following conclusions. On one hand. Hill diversities 
that depend on the initial part of the rarefaction curve, that is, Da for a close to 2, can 
be estimated robustly. On the other hand. Hill diversities that depend on the part of the 
rarefaction curve for large sample size, that is. Da for a close to 0, cannot be estimated 
robustly. We now seek to make this classification of community diversities more precise. 



Estimators for Hill diversities 



We have argued that the Hill diversities Da with a close to 2 can be estimated accurately, 
and that the Hill diversities Da with a close to cannot be estimated accurately. In this 
section we introduce and study estimators for the set of Hill diversities Da with < a < 2. 

We have shown that a wide variety of communities may be consistent with any given sample 
data. Here we look for two extreme members of this set of reconstructed communities. We 
construct a lower estimate of the diversity, D^, by assuming that unobserved species are 
approximately as rare as the rarest observed species. We construct an upper estimate of the 
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Figure 4: Estimated Hill diversities for in silico communities. We generated samples from 
a community with power-law abundance distribution (S — 10^, z — 2) and evaluated the 
estimators 13+ and D~ for the Hill diversity Da- We consider three sample sizes M (in 
columns: M = 10^, 10^, 10*^) and three community sizes N (in rows: N = 10^°, 10^^, 10^°). 
The shaded range between £)+ and D~ indicates the estimation uncertainty. The true Hill 
diversity of the community is plotted in black. The Hill diversities between a = 1 
(Shannon) and a — 2 (Simpson) are correctly estimated even for small sample size M. 
The estimates of Hill diversities less than a — 1, including a = (species richness), are 
characterized by large uncertainty. 

diversity, Z3+, by assuming that unobserved species are represented in the community by a 
single individual. We first extrapolate the rarefaction curve based on these assumptions, see 
Figure 3, and then use the extrapolated curves to calculate the Hill diversities. The detailed 
construction of the estimators and is presented in Supplementary Texts S3, S4 and 
S5. A summary of the estimator formulas can be found in the Materials and Methods section. 
We provide Matlab code to compute the estimators in the Supplementary Information. 

Two properties follow directly from the definition of the estimators D~ and D^, see Sup- 
plementary Text S5. First, the lower estimate Dg for species richness is equal to Chao's 
estimator. Hence, the lower estimate D~ generalizes Chao's estimator for Hill diversities Da 
with a > 0. Second, the estimators for Simpson diversity D2 coincide, D2 = D2 ■ This 
corresponds to the existence of an unbiased, non-parametric estimator for the Simpson con- 
centration index, and confirms that Simpson diversity D2 is particularly easy to estimate, 
even for small sample size M. Note that the lower estimate can be computed from the sample 
data alone, but the upper estimate also requires an estimate of the community size N. 
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In Figure 4 we apply the estimators and to sample data from an in silico community. 
For a > 1 the lower and upper estimates almost coincide, so that the Hill diversities Da with 
a > 1, and in particular Simpson diversity D2, may be estimated with small error. This holds 
for any sample size M (as small as M = 100) and any community size N. For a < 1 the 
upper estimate increases steeply, so that the estimation uncertainty of the Hill diversities 
with a small, and in particular species richness Dq, is very large. This holds for any sample 
size M (as large as M = 10^) and any community size N much greater than M. The effect 
of sample size M and community size N is only pertinent for a close to 1. For these values 
of a the range between the lower and upper estimates narrows with increasing sample size 
Al and decreasing community size N, so that increasingly accurate estimates are obtained 
for Shannon diversity Di. 



We observe the same behavior when applying the Hill diversity estimators to empirical sample 
data, see Figure 5. We applied the estimators to nine metagenomic data sets from a wide 



range of environments: soil samples at four locations (Roesch et al. 20071, a seawater sample 



(Huber et al. 


2007 


studied in Fi 


gure ^ 



from the upper ocean ( Rusch et al. 2007 1 and seawater samples at two deep-sea vent locations 



The Hill diversities Da for a > 1, including Shannon and Simpson 
diversity, can be estimated reliably. For small a the estimation uncertainty is very large, that 
is, Hill diversities close to species richness cannot be estimated reliably. The dependence of 
the estimation accuracy on the (estimated) community size N is weak, see Supplementary 
Figure S4. These observations show that our analysis for in silico communities is relevant 
for real communities as well. 



Discussion 



We have argued that the estimation of species richness is intrinsically problematic. We 
have provided evidence in three different but related ways. First, we have shown that it 
is possible to add a large number of rare species to the community without significantly 
affecting its statistical properties under fixed-size sampling, see Figure 1. As the number of 
added rare species can be large, the estimation uncertainty of the number of species is large 
as well. Second, we have discussed an exact relationship between the community rarefaction 
curve and the set of Hill diversities. Hill diversities close to Simpson's are based on the 
initial part of the rarefaction curve, which can be reliably interpolated from sample data. 
Hill diversities beyond Shannon's, and species richness in particular, depend on parts of 
the rarefaction curve orders of magnitude beyond the actual sample size, whose estimation 
requires unverifiable extrapolation. Third, we have constructed two estimators related to the 
Hill diversities, delimiting the range in which each true Hill diversity is expected to lie. This 
range is relatively narrow for diversities from Simpson's to Shannon's, but it diverges for 
diversities towards species richness, see Figures 4 and 5. Hence, the estimation uncertainty 
of species richness is intrinsically large. 

We have also studied a weaker form of species richness estimation, namely, whether commu- 
nities can be ranked according to species richness based on sample data. We have argued 
that also in this case the sample data are not sufficiently informative. The example shown in 
Figure 2 is interesting, because the community ranking based on estimated species richness, 
although completely different from the ranking based on true richness, is the same as the 
ranking based on true Simpson or Shannon diversity, see Supplementary Table SI. This ob- 
servation can be understood intuitively. The insensitivity of the species richness estimator to 
the very rare species in the community is shared by the Simpson and Shannon diversity, but 
not by the community species richness. In fact, different diversity estimators often yield the 



same community ranking ( Shaw et al. 2008 ) . This should not be interpreted as an indication 



of the validity of the ranking for species richness; the ranking based on true species richness 
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Figure 5: Estimated Hill diversities for natural microbial communities. We observe the same 
behavior as for the in silico generated data sets of Figure 4: for a > 1 the Hill diversity 
Da can be estimated accurately; for a < 1 the estimation of the Hill diversity Da has large 
uncertainty. We used the same data sets as Quince et al. ( 2008 1 : a seawater bacterial sample 



from the upper ocean (Rusch et al. 2007|, soil bacterial samples at four locations: Brazil, 



Florida, Illinois and Canada ( Roesch et al] 2007), and seawater samples from deep-sea vents 
at two locations: FS312 and FS396, separated into bacteria and archaea (Huber et al. 2007). 
The community size was set to = 10^^ for illustration; results are robust to changes in 
community size (see Supplementary Figure S4). 
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can be completely different. Communities should only be ranked according to community 
properties that can be estimated reliably. 



The intrinsic problem of species richness estimation can be unlocked by introducing more 
information in the estimation procedure. Obviously, the reliability of the estimate crucially 
depends on the reliability of the additional information. For example, assuming a family of 
abundance distributions (for example, lognormal) can lead to species richness estimates with 
small uncertainty ( Schloss and Handelsman 2005[|Hong et al. 2006 Quince ei a/. 2008). But 
both the estimate and the uncertainty are conditional on the assumed distribution family. 
In particular, assuming a species abundance distribution also fixes the rare species tail and, 
as we have argued, the sample data contain little information about the rare species tail. 
Hence, the choice of distribution family is arbitrary. Still, this choice strongly affects the 
species richness estimate. We believe this to be a serious problem for this approach to 
diversity estimation. 

Other assumptions have been introduced to make diversity estimation manageable. Some 
regularity has been observed in the distribution of diversity over coarse taxonomic groups 
(Mora et al. 2011). This regularity can be assumed down to the species level to guide the 
estimation of species richness. Clearly, the approach depends crucially on the unverifiable 
validity of the extrapolation. More generally, this and other approaches attempt to reduce 
the wide range of diversity values consistent with the data to a single value. This implies 
that the reduction step is based on detailed information not contained in the sample data. 
Such an approach is necessarily very sensitive to the detailed assumptions, and therefore not 
robust. 



Mao and Colwell ( 2005 ) pointed out that rare species pose a serious problem for estimating 



species richness. In this paper we have shown a practical way forward by quantifying the 
range of diversity values consistent with the data. The latter idea underlies our construction 
of lower and upper estimates of community diversity, and is also crucial for Chao's estimator 
of species richness (Chao 1984). This estimator does not attempt to directly assess true 
species richness, but rather approximates the lowest species richness consistent with the 
sample data. In many practical cases this indirect estimation is the most informative claim 
that can be made about species richness. 



Different studies have highlighted the role of rare species in microbial communities ( Dykhuizen 



T998}|Pedr6s-1Iiol|2006l[Sogin et aLi|2006||Pedr6s-AHol|2007l|Hnber et aL[|2007l|Gobet et al. 



2010). We have argued that sample data contain limited information about the rare species 



tail of the community. For example, the total number of rare species cannot be estimated. 
However, an estimator for the relative abundance of unobserved species is available, see Sup- 
plementary Text S4. For the data sets we have analyzed the estimated relative abundance 
ranges from 0.1% to 5%, see Supplementary Table S2. These estimates depend on sample 
size. It might be more practical to use a notion of rarity that is independent of sample size. 
For example, we could call a species rare if its community abundance is below a certain 
threshold value (for example, relative abundance below 10~^). We plan to address the prob- 
lem of estimating the relative abundance of rare species in a sample-independent fashion as 
part of future work. 

In this paper we have only considered taxonomic diversity. Other notions of diversity such 



as functional and phylogenetic diversity are becoming increasingly popular ( Horner-Devine 



and Bohannan 2006 Lozupone and Knight 2007 Green et al. 2008 ) . Our study suggests 



that any diversity metrics that strongly depend on rare species will be difficult or impossible 
to estimate robustly. It is interesting to note that other measurement techniques for micro- 
bial diversity are confronted with limitations similar to those of the sample-based techniques 
discussed in this paper. The reassociation kinetics of community DNA are affected by com- 



munity diversity (Torsvik et al. 1990 Cans et al. 2005), but it has been argued that not 



species richness, but Simpson and Shannon diversity can be estimated from the data (Haege- 
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man et al. 2008 ) . Fingerprinting techniques provide snapshots of the community structure 
(Fromin et al. 2002): in this context also, the estimation of species richness seems to be 



impossible for highly diverse communities (Loisel et al. 2006 Bent and Forney 2008), but 



preliminary results indicate that accurate estimators can be constructed for Simpson diver- 
sity. Estimates of the total number of genes in a species, i.e., the pan genome size, has been 



estimated from a small number of sample genomes ( Tettelin et al. 2005 1 , but it is has been 



argued that these estimates are not robust and that similarity-based metrics should be used 



instead (Kislyuk ei a/. 2011). 



These findings together with those of this paper make a strong case for the versatility of gen- 
eralized diversities for the analysis of microbial diversity estimation. They can be interpreted 



as effective number of species giving greater weight to common species ( Hill 1973 Jost 2006 ) , 



and have superior estimation properties compared to species richness. We recommend the use 
of Shannon and Simpson diversity to quantify and compare microbial taxonomic diversity. 
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Supplementary Text 
Text SI 

Contribution of rare species to rarefaction curve 



We define 5*^ as the expected number of species in a sample of m individuals taken from the 
community. The rarefaction curve of the community is the plot of the number of species 
as a function of the sample size m. We consider a community consisting of S species with 
relative abundance pi,p2, ■ ■ . ,ps- Then the expected number of sampled species Sm is given 

by 



5„ = ^(l-(l-p,r)- (SI) 



It is important to distinguish the community rarefaction curve ( SI I from the rarefaction curve 
estimated from sample data. We consider a sample of size M taken from the community. We 
denote the number of species observed in the sample by ^obs, and the number of species with 
abundance k in the sample by F]^. For m < M the rarefaction curve can be estimated 
by taking subsamples of size m out of the sample. The average number of species observed 
in the subsample (averaged over all subsamples of size m) is an estimator for Sm, 



/M-k\ 
\ m J 

k>l ^ \mJ 



5™ = E^^(l-\^)' (S2) 



This estimator is reliable in the sense that it is unbiased (that is, the expected value of Sm 
is equal to Sm)- Moreover, there is no other unbiased estimator with smaller variance. For 
m > M the estimation of the rarefaction curve is necessarily based on extrapolation, leading 
to less reliable estimates, especially for m ^ M. 

We define a species to be rare if its relative abundance is much smaller than . This means 
that a rare species is unlikely to be present in the sample (of size M). For concreteness we 
say that 

species i is rare if pi < ^ . (S3) 

Note that our definition of rarity depends on the sample size M. The choice of a threshold 
for rarity is arbitrary, though our results are robust to changes in the constant (which in this 
case has been set to 50) so long as it is much greater than 1. 



We consider the rarefaction curve (SI) up to sample size M. The contribution of species i 
can be written as 

The j-th term in this sum is the probability that species i is represented j times in a sample 
of size TO. For a rare species i we have Pi ^ < ^, and the first term dominates the other 
terms. Hence, 

1 - (1 - ft)" « mp, (1 - ft)™"' « Wft, TO < M. 
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Partitioning the set of species into rare and non-rare species, we get 



s s 

S,n « E (l - (1 - P^)") + E "^P- 
i non-rare i rare 



= E (l - (1 - P*)'") + ' rn<M, (S4) 

i non-rare 

with Piaro the total relative abundance of the set of rare species in the community. 



From Equation ( S4 ) it follows that the rarefaction curve does not depend on the abundance 



distribution of the rare species, but only on the total abundance of the rare species. This 



follows directly from Definition ( S3 ) : it is unlikely that a rare species will be observed twice 
in a sample of size m (when m < M). Therefore, the contribution of the rare species to 
the sample species richness depends only on their prevalence in the sample which, in turn, 
depends only on their prevalence in the community. In particular, rarefaction curves obtained 



for different abundance distributions of the rare species are indistinguishable, see Figure SI 
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Text S2 



Contribution of rare species to Hill diversities 

In the main text we have introduced the Hill diversities Da, 

1 

/ S \ 



= . (S5) 



1=1 



The Hill diversity of order 1 is defined as the limit Di = lima^i Dq,, and is related to the 
Shannon diversity index H, 

s 

Di = e" with H = ^-p, Inp,. (S6) 

The Hill diversity of order 2 is related to the Simpson concentration index C, 

s 

with a — 

c 



D2 = ^ With C^Y.P'- 



1=1 

The Hill diversity of order oo is related to the relative abundance Pmax of the most abundant 
species, 

Doo = with pinax = max {pi,p2, . ■ . ,ps} ■ 

Pmax 

We consider a community in which the rare species occupy a fraction Prarc of the total 
community abundance. We study the dependence of the Hill diversity on the number of rare 
species S'rarc- Assuming that the rare species have equal abundance, we get 

\ i=i i=i / 

i non-rare i rare 



S 

i=l 
i non-rare 

S 



Lrc 

+Karc^raT" ) ■ (S7) 



4=1 

i non-rare 



The first term inside the brackets contains the contribution of the non-rare species. The 
second term inside the brackets, pf^j.^Sl~^, contains the contribution of the rare species. 
The contribution of the non-rare species is independent of S'rarc- For a > 1 the contribution 
of the rare species decreases with S'rare and vanishes for S'rarc oo. Hence, the rare species 
contribute only weakly to the Hill diversity Da for a > 1. For a < 1 the contribution of 
the rare species increases with Siare and diverges for S'rare ^ oo. Hence, for sufficiently large 
S'rarc the rare species contribution dominates the Hill diversity Da for a < 1. Note that the 
relative contribution of the rare to the non-rare species has a power-law dependence on S'rare 
with exponent 1 — a. For the Hill diversity Di the relative contribution of the rare to the 



non-rare species has a logarithmic dependence on S'rarc, see (S6) 
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Text S3 



Hill diversities and rarefaction curve 



We follow Mao (2007) to establish a link between the rarefaction curve Sm and the Hill 



diversities Da. Rewriting the sum X^j Pf ' S^^ 



S oo 



m! 1 (a — TO + 1) 

i=l m=0 ^ ^ 



^ (-l)"T(a+ 1) _ l^-lj -1 t^Ct -t- ^ \m\ 



(-l)"T(a + 1) 



s 



^ (-ir+ir(« + i) g^ 

^-^ m!r(a — rn + l) " 

m— 1 ^ ^ 



E 



a r(TO — a) 



-'^ to! r(l — a) 
where F denotes the gamma function. Hence, 



aT{m-a) \ ^- 
^ TO!r(l-a) " 

m— 1 ^ ' 



(S8) 



We express the link with the rarefaction curve in terms of the Tsallis entropies Ta ( Tsallis 



1988) 



i—l 

which is closely related to the Hill diversities Da, 

Da = {I + {1 - a)Ta)^- 



Equation (S8) becomes 



1 -a 

-1 + E 



1 + g ^^fe^5„ 



m=2 

a r(TO — a) 



ml r(2 - a 



to! r(l — a) 

Sm- 



We study the behavior of the coefficients in this infinite sum, 

a T{m — a) 
""^ " TO!r(2-a)- 

For a e (0, 2) all coefficients c,„ are positive, and 



,-("+!) 



as TO — > oo. 



(S9) 



(SIO) 



This shows that different Tsallis entropies Ta depend on different parts of the rarefaction 
curve Sm- For a close to 2, the Tsallis entropy Ta is mainly determined by the rarefaction 
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curve for small m. For decreasing a, the contribution of the rarefaction curve for large m 
increases. For the limit cases a — ?> and a — > 2 the constant of proportionality in (SIO) 
vanishes. For a = 2 we have T2 = I — C = S2 — the only contribution of the rarefaction 
curve is at m = 2. For a = we have Tq ^ S—1 = Soo ~ 1: the contribution of the rarefaction 
curve is entirely shifted to to — > 00. This analysis also holds for the Hill diversities because 
Da is an increasing function of Tq,, see (S9|. 



As an illustration, we apply (S8) to a community with a power-law tail. That is, we consider 



an artificial community consisting of an infinite number of species, for which the species are 
arranged in decreasing order of abundance, and for which 



Pi 



as I — ^ 00. 



The abundances should be summable, so we have to impose that z > 1. The tail of the 
abundance distribution determines the asymptotic behavior of the rarefaction curve. 



as TO — > 00. 



From (S8) and (SIO) it follows that the diversity is finite for a > ^, and diverges for 
a < i. Th 



lis can be checked directly from Definition ( S5 ) 
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Text S4 



Estimating species abundances from sample data 



The Good- Turing estimators (Good 19531 are a well-known family of frequency estimators. 
Here we present a compact derivation, given in Nadas (1985), which demonstrates that the 
Good-Turing estimators are non-parametric, that is, free of assumptions about the abundance 
distribution. 

Let be a random variable taking values between and 1, with a distribution function G{9) 
about which nothing is known. Suppose that R is another random variable whose conditional 
distribution pMirlB), when O has the value 9, is binomial with parameters M and 6, 



Then we have the identity 



PAiir 



^ PM{r\i 



M-r 



1 



M + 1 



PM+i{r + 1| 



(Sll) 



(S12) 



Suppose now that we wish to estimate the value of given that R is observed to take the 
value r. Taking a Bayesian approach with prior distribution G, the posterior mean for is 



E[e\R = r] = 



r + 1 pM+i{r + l) 



M + 1 



PM{r) 



(S13) 



where pm is the unconditional probability mass function of R (that is, integrated out over 
G). This derivation is non-parametric in that G is not only unknown, but no assumptions 
are made about G: the probability mass function pM must therefore be estimated directly 
from the sample data, so that we are in fact performing empirical Bayes estimation. 

In the context of diversity estimation, we regard G as the community abundance distribution, 
9 as the species abundance to be estimated and r as the number of times that this species 
occurs in the sample. We use the maximum hkelihood estimates for pm{i") and pM+i(r + 1) 
given by Fr/M and Fr+i/{M + 1), respectively. Plugging the estimates into (S13l and 
assuming that M ^ 1 , we get the estimated community abundance Or of a species observed 
r times in the sample, 

^ _ r + 1 Fr+i 

M Fr ' 

which are the Good- Turing frequency estimators. 



(S14) 



As a corollary of (S14) we get the estimator for the total abundance of the observed species, 

M -Fi 



Y.FX = ^^Y.^r + l)Fr+^ 



r>l 



r>l 



M 



so that the total abundance Punobs of the unobserved species is estimated as 

Punobs • 



(S15) 



In words, the total relative abundance of unobserved species in the community is estimated 
as the total relative abundance of singletons in the sample. 
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Text S5 



Estimating Hill diversities from sample data 



We construct estimators for the Hill diversity Da based on a sample of size M taken from 
the community. Our strategy consists in first estimating the rarefaction curve Sm and then 
using the link (S8) between Da and Sm- 



The estimation of the rarefaction curve decomposes into two parts. For the part m < M the 



rarefaction curve can be estimated unbiasedly using the estimator (S2 ). For the part m > M 



the sample data have to be extrapolated, and no unbiased estimator exists. We denote the 
relative abundances of the unobserved species by gi , 92 , ■ • • (there are S — S'obs unobserved 
species). If we knew the abundances qi, then we could compute the rarefaction curve using 
the formula, 

^ob. + ^(l-(l-ft)™"*') ^>M. (S16) 



S„ 



As we have argued in the main text, the sample data contain little information about the 
abundances of unobserved species. However, the Good- Turing estimator ( S15 ) for the total 
abundance Punobs = X]i>i 9* of the unobserved species is available. It follows from (S16) that 
the estimation of the rarefaction curve Sm for m > M reduces to distributing the estimated 
abundance Punobs over the individual unobserved species. 

We work out two scenarios, see Figure 3 of the main text. In the first scenario we distribute 
Punobs so as to obtain the lowest possible value of the diversity Da consistent with the sample 
data. By this we mean that Punobs must be distributed in a manner which remains consistent 
with the estimates Or- The lowest diversity occurs when all unobserved species have the same 
abundan ce. 01= ar> = ... ~ q^ , and this abundance is as high as possible. However, as 
noted in 



Good 



(1953), the frequency estimates 0r must increase as r increases: this implies 
an upper bound for namely 61 (which is the estimated community abundance of any 
species observed exactly once in the sample). We therefore take q 



so that, 



from (S15I, there are ^p, 
becomes 



unobserved species. Hence, the estimated rarefaction curve (S16) 



S„ 



^obs 



2F2 



2F2 



i-M 



■m> M, 



(S17) 



where the superscript in 5„j indicates the low-diversity scenario. 



In the second scenario we distribute Punobs so as to obtain the highest possible value of the 
diversity Da- The highest diversity is obtained when all unobserved species have the same 



abundance, qi = q2 



and this abundance is as small as possible. The smallest 

correspondin g to a 



abundance a species can have in a community of size N is equal to i 
species represented by a single individual. We therefore take q'^ — 

there are unobserved species. Hence, the estimated rarefaction curve (|S16|) becomes 



so that, from (S15 1, 



M 



^m ^ohs 



NFi 
M 



1-1- 



N 



> M, 



(S18) 



where the superscript in Sm indicates the high-diversity scenario. Note that the upper 



estimator (SI8I depends on the community size N, in contrast to the estimator (S17l 



To summarize, we have obtained two estimators for the Hill diversity Da, a lower estimate 
Da and an upper estimate They can be computed as follows: 
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Lower estimate First, compute tlie lower estimate of the rarefaction cm've. From (S2) and 



iEfc>i^fc(l-^) ifm = l,2,...,M 



^obs + ^(l-(l-f^) j ifm = Af + l,Af + 2, 
Then, substitute this resuh into (|S8[) to estimate the Hih diversity, 



00 



1=1 ^ ' ^ 



Upper estimate First, compute the upper estimate of the rarefaction curve. From (S2) 
and (ISlsl), 



5, 



if m = 1,2, 



ob. + ^(l-(l-^)'"-'') ifM + l,M + 2,.. 



Then, substitute this result into ( S8 1 to estimate the Hill diveristy. 



^"-12^^! r(i - a) ^" j ■ 

^ m— 1 ^ ' ^ 



(S21) 



(S22) 



The Matlab code to compute the Hill diversity estimates and is part of the Supple- 
mentary Information. 

We discuss three properties of the estimators and Z?+ that follow directly from their 
definitions. First, the lower estimate I)~ generalizes Chao's estimator for species richness, 

^ ^ ^2 

Note that the lower estimate, like Chao's estimator, only gives meaningful results if the 
number of species observed once or twice in the sample is sufficiently large, and at least 
F2 > 0. These conditions are typically satisfied in practice, especially for highly diverse 
communities. 

Second, the upper estimate iD+ depends on community size N, which is typically several 
orders of magnitude larger than sample size M. It is therefore instructive to consider the 
limit iV — >■ 00. A computation analogous to the one in Text S2 shows that the upper estimate 
diverges as A^^"" for a < 1, and as logA^ for a = 1. Hence, we expect large values of 
the upper estimate (and therefore large estimation uncertainty) for a < 1, especially for a 
close to zero (that is, close to species richness). 

Third, the estimators and coincide for the Simpson diversity. The Simpson diversity 
D2 is the only Hill diversity Da that does not depend on the extrapolation of the rarefaction 
curve. It is a function of the rarefaction curve at m = 2: D2 — 2-S2 ' because the initial 
part of the estimated rarefaction curve is the same for the lower and upper estimate, the 
Simpson diversity estimates are equal, D2 = I?^. The Simpson diversity is not sensitive to 
the extrapolation of the rarefaction curve, and therefore easy to estimate. 
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Supplementary Tables 



Table SI 



Table SI: Description of communities used in Figure 2. Communities CI, C2 and C3 have a 
power-law abundance distribution, with parameters S, the number of species in the commu- 
nity, and z, the exponent of the power-law. The Hill diversity of order a = is equal to the 
number of species, Dq = S; the Hill diversity of order a = 1 is the Shannon diversity; the 
Hill diversity of order a = 2 is the Simpson diversity. For a sample of size 2 10^, the number 
of observed species is denoted by 5obs and Chao's estimator for species richness is denoted 
by ^chao- 



5~L(? O 640 35 4.8 lO'* 1.5 10'' 
2 10^ 1.3 100 11 2.4 103 8.3 10^ 
10*^ 1.6 15 4.5 690 1.8 10^ 



community CI 
community C2 
community C3 
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Table S2 



Table S2: Data for empirically-sampled microbial communities. We report the sample size 
M, the number of species observed in the sample S'obs, the number of singleton species Fi, 
that is, the number of species that have been sampled only once, the estimated relative 
abundance of the unobserved species Punobs, and the Chao estimate Schao for the number 



of species in the community. The data sets are taken from Quince et al. (2008): a seawater 



bacterial sample from the upper ocean (Rusch et al. 2007), soil bacterial samples at four 



locations: Brazil, Florida, Illinois and Canada (Roesch et al. 2007), and seawater samples 



from deep-sea vents at two locations: FS312 and FS396, separated into bacteria and archaea 



(Huber et al. 2007). 





M 


S'obs 


Fi 


J^unobs 


Schao 


upper ocean 


7068 


811 


311 


0.044 


1038 


soil, Brazil 


26079 


2880 


1176 


0.045 


4604 


soil, Florida 


28150 


3440 


1541 


0.055 


5643 


soil, Illinois 


31621 


3357 


1466 


0.046 


5745 


soil, Canada 


52773 


5515 


2634 


0.050 


10394 


FS312, bacteria 


442062 


12183 


5339 


0.012 


19568 


FS312, archaea 


200199 


1594 


460 


0.002 


2175 


FS396, bacteria 


247826 


5843 


2825 


0.011 


10570 


FS396, archaea 


16428 


418 


158 


0.010 


630 
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Supplementary Figures 



Figure SI 




Figure SI: Sample data are insensitive to rare species tail of community. We generated 
three community abundance distributions, shown in red, blue and green (panels a-c). The 
three communities have the same abundance distribution for species with relative abundance 
above 10^^ (the part of the rank-abundance curve to the left of the dashed black line). This 
common part consists of 6 10'^ species, occupying 99% of the community abundance. The 
communities differ in the tail of rare species: the community in panel a has 1.6 10^ species; 
the community in panel b has 1.6 10^ species; the community in panel c has 10^ species. 
Despite the marked differences, the rarefaction curves of the three communities up to sample 
size 2 10"* are identical (see panel d). This observation holds generally: any set of rare species 
leads to the same rarefaction curve if each rare species has relative abundance below 10~^ 
and the total relative abundance of the set of rare species equals 0.01. 
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Figure S2 




Figure S2: Hill diversity for large a is insensitive to rare species tail. Panel a: We computed 



the Hill diversity Da for the three communities of Figure SI The Hill diversities for a > 1 



almost coincide because the communities have the same set of non-rare species. The Hill 
diversities for a < 1 differ because the communities have different rare species tails. Panel b: 
We computed the Hill diversity Da for the three communities of Figure 2. The curves of 
Hill diversities intersect. For small a, the most species-rich community (C3, green) has the 
largest Hill diversity, and the most species-poor community (CI, red) has the smallest Hill 
diversity. For larger a, the most even community (CI, red) has the largest Hill diversity, and 
the most uneven community (C3, green) has the smallest Hill diversity. 
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Figure S3 




Figure S3: Rank-abundances curve of empirical microbial community samples. Relative 
abundance in the sample is plotted against species rank in the sample. We used the same 



data sets as Quince et al. (2008): a seawater bacterial sample from the upper ocean (Rusch 
et al. 2007), soil bacterial samples at four locations: Brazil, Florida, Illinois and Canada 



(Roesch et al. 2007), and seawater samples from deep-sea vents at two locations: FS312 and 
FS396, separated into bacteria and archaea (Huber et al. 2007). 
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Figure S4 



upper ocean 




soil, Brazil 



soil, Illinois 




0.5 1 1.5 2 

a FS312, archaea 




0.5 1 1.5 2 

Hill parameter 




8 soil, Florida 



8 soil, Canada 




0.5 1 1.5 2 

8 FS396, bacteria 




0.5 1 1.5 2 

Hill parameter 




8 FS312, bacteria 




0.5 1 1.5 2 

8 FS396, archaea 




0.5 1 1.5 2 

Hill parameter 



Figure S4: Community-size dependence of Hill diversity estimates. Same data sets as in 
Figure 5, but for three values of community size A'^. The lower estimate is independent of 
N; the upper estimate increases with increasing N (from left to right: N = 10^°, N = 10^^, 
N = 10^°). We observe the same behavior as for the in silico generated data sets of Figure 4. 
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